Recommendation Systems with TensorFlow

Introduction

In this lab, we will create a movie recommendation system based on the MovieLens dataset available here. The data consists of movies ratings (on a scale of 1 to 5). Specifically, we'll be using matrix factorization to learn user and movie embeddings. Concepts highlighted here are also available in the course on Recommendation Systems.

Objectives

  1. Explore the MovieLens Data
  2. Train a matrix factorization model
  3. Inspect the Embeddings
  4. Perform Softmax model training

We then download the MovieLens Data, and create DataFrames containing movies, users, and ratings.

Exploring the Movielens Data

Before we dive into model building, let's inspect our MovieLens dataset. It is usually helpful to understand the statistics of the dataset.

Users

We start by printing some basic statistics describing the numeric user features.

We can also print some basic statistics describing the categorical user features

We can also create histograms to further understand the distribution of the users. We use Altair to create an interactive chart.

Next, we look at the distribution of ratings per user. Clicking on an occupation in the right chart will filter the data by that occupation. The corresponding histogram is shown in blue, and superimposed with the histogram for the whole data (in light gray). You can use SHIFT+click to select multiple subsets.

What do you observe, and how might this affect the recommendations?

Movies

It is also useful to look at information about the movies and their ratings.

Finally, the last chart shows the distribution of the number of ratings and average rating.

Preliminaries

Our goal is to factorize the ratings matrix $A$ into the product of a user embedding matrix $U$ and movie embedding matrix $V$, such that $A \approx UV^\top$ with $U = \begin{bmatrix} u_{1} \\ \hline \vdots \\ \hline u_{N} \end{bmatrix}$ and $V = \begin{bmatrix} v_{1} \\ \hline \vdots \\ \hline v_{M} \end{bmatrix}$.

Here

Sparse Representation of the Rating Matrix

The rating matrix could be very large and, in general, most of the entries are unobserved, since a given user will only rate a small subset of movies. For effcient representation, we will use a tf.SparseTensor. A SparseTensor uses three tensors to represent the matrix: tf.SparseTensor(indices, values, dense_shape) represents a tensor, where a value $A_{ij} = a$ is encoded by setting indices[k] = [i, j] and values[k] = a. The last tensor dense_shape is used to specify the shape of the full underlying matrix.

Toy example

Assume we have $2$ users and $4$ movies. Our toy ratings dataframe has three ratings,

user_id movie_id rating
0 0 5.0
0 1 3.0
1 3 1.0

The corresponding rating matrix is

$$ A = \begin{bmatrix} 5.0 & 3.0 & 0 & 0 \\ 0 & 0 & 0 & 1.0 \end{bmatrix} $$

And the SparseTensor representation is,

SparseTensor(
  indices=[[0, 0], [0, 1], [1,3]],
  values=[5.0, 3.0, 1.0],
  dense_shape=[2, 4])

Exercise 1: Build a tf.SparseTensor representation of the Rating Matrix.

In this exercise, we'll write a function that maps from our ratings DataFrame to a tf.SparseTensor.

Hint: you can select the values of a given column of a Dataframe df using df['column_name'].values.

Calculating the error

The model approximates the ratings matrix $A$ by a low-rank product $UV^\top$. We need a way to measure the approximation error. We'll start by using the Mean Squared Error of observed entries only (we will revisit this later). It is defined as

$$ \begin{align*} \text{MSE}(A, UV^\top) &= \frac{1}{|\Omega|}\sum_{(i, j) \in\Omega}{( A_{ij} - (UV^\top)_{ij})^2} \\ &= \frac{1}{|\Omega|}\sum_{(i, j) \in\Omega}{( A_{ij} - \langle U_i, V_j\rangle)^2} \end{align*} $$

where $\Omega$ is the set of observed ratings, and $|\Omega|$ is the cardinality of $\Omega$.

Exercise 2: Mean Squared Error

Write a TensorFlow function that takes a sparse rating matrix $A$ and the two embedding matrices $U, V$ and returns the mean squared error $\text{MSE}(A, UV^\top)$.

Hints:

Note: One approach is to compute the full prediction matrix $UV^\top$, then gather the entries corresponding to the observed pairs. The memory cost of this approach is $O(NM)$. For the MovieLens dataset, this is fine, as the dense $N \times M$ matrix is small enough to fit in memory ($N = 943$, $M = 1682$).

Another approach (given in the alternate solution below) is to only gather the embeddings of the observed pairs, then compute their dot products. The memory cost is $O(|\Omega| d)$ where $d$ is the embedding dimension. In our case, $|\Omega| = 10^5$, and the embedding dimension is on the order of $10$, so the memory cost of both methods is comparable. But when the number of users or movies is much larger, the first approach becomes infeasible.

Training a Matrix Factorization model

CFModel (Collaborative Filtering Model) helper class

This is a simple class to train a matrix factorization model using stochastic gradient descent.

The class constructor takes

After training, one can access the trained embeddings using the model.embeddings dictionary.

Example usage:

U_var = ...
V_var = ...
loss = ...
model = CFModel(U_var, V_var, loss)
model.train(iterations=100, learning_rate=1.0)
user_embeddings = model.embeddings['user_id']
movie_embeddings = model.embeddings['movie_id']

Exercise 3: Build a Matrix Factorization model and train it

Using your sparse_mean_square_error function, write a function that builds a CFModel by creating the embedding variables and the train and test losses.

Great, now it's time to train the model!

Go ahead and run the next cell, trying different parameters (embedding dimension, learning rate, iterations). The training and test errors are plotted at the end of training. You can inspect these values to validate the hyper-parameters.

Note: by calling model.train again, the model will continue training starting from the current values of the embeddings.

The movie and user embeddings are also displayed in the right figure. When the embedding dimension is greater than 3, the embeddings are projected on the first 3 dimensions. The next section will have a more detailed look at the embeddings.

Inspecting the Embeddings

In this section, we take a closer look at the learned embeddings, by

Exercise 4: Write a function that computes the scores of the candidates

We start by writing a function that, given a query embedding $u \in \mathbb R^d$ and item embeddings $V \in \mathbb R^{N \times d}$, computes the item scores.

As discussed in the lecture, there are different similarity measures we can use, and these can yield different results. We will compare the following:

Hints:

Equipped with this function, we can compute recommendations, where the query embedding can be either a user embedding or a movie embedding.

Movie Nearest neighbors

Let's look at the neareast neighbors for some of the movies.

It seems that the quality of learned embeddings may not be very good. Can you think of potential techniques that could be used to improve them? We can start by inspecting the embeddings.

Movie Embedding Norm

We can also observe that the recommendations with dot-product and cosine are different: with dot-product, the model tends to recommend popular movies. This can be explained by the fact that in matrix factorization models, the norm of the embedding is often correlated with popularity (popular movies have a larger norm), which makes it more likely to recommend more popular items. We can confirm this hypothesis by sorting the movies by their embedding norm, as done in the next cell.

Note: Depending on how the model is initialized, you may observe that some niche movies (ones with few ratings) have a high norm, leading to spurious recommendations. This can happen if the embedding of that movie happens to be initialized with a high norm. Then, because the movie has few ratings, it is infrequently updated, and can keep its high norm. This can be alleviated by using regularization.

Try changing the value of the hyperparameter init_stddev. One quantity that can be helpful is that the expected norm of a $d$-dimensional vector with entries $\sim \mathcal N(0, \sigma^2)$ is approximatley $\sigma \sqrt d$.

How does this affect the embedding norm distribution, and the ranking of the top-norm movies?

Embedding visualization

Since it is hard to visualize embeddings in a higher-dimensional space (when the embedding dimension $k > 3$), one approach is to project the embeddings to a lower dimensional space. T-SNE (T-distributed Stochastic Neighbor Embedding) is an algorithm that projects the embeddings while attempting to preserve their pariwise distances. It can be useful for visualization, but one should use it with care. For more information on using t-SNE, see How to Use t-SNE Effectively.

You can highlight the embeddings of a given genre by clicking on the genres panel (SHIFT+click to select multiple genres).

We can observe that the embeddings do not seem to have any notable structure, and the embeddings of a given genre are located all over the embedding space. This confirms the poor quality of the learned embeddings. One of the main reasons is that we only trained the model on observed pairs, and without regularization.

Softmax model

In this section, we will train a simple softmax model that predicts whether a given user has rated a movie.

The model will take as input a feature vector $x$ representing the list of movies the user has rated. We start from the ratings DataFrame, which we group by user_id.

We then create a function that generates an example batch, such that each example contains the following features:

Loss function

Recall that the softmax model maps the input features $x$ to a user embedding $\psi(x) \in \mathbb R^d$, where $d$ is the embedding dimension. This vector is then multiplied by a movie embedding matrix $V \in \mathbb R^{m \times d}$ (where $m$ is the number of movies), and the final output of the model is the softmax of the product $$ \hat p(x) = \text{softmax}(\psi(x) V^\top). $$ Given a target label $y$, if we denote by $p = 1_y$ a one-hot encoding of this target label, then the loss is the cross-entropy between $\hat p(x)$ and $p$.

Exercise 5: Write a loss function for the softmax model.

In this exercise, we will write a function that takes tensors representing the user embeddings $\psi(x)$, movie embeddings $V$, target label $y$, and return the cross-entropy loss.

Hint: You can use the function tf.nn.sparse_softmax_cross_entropy_with_logits, which takes logits as input, where logits refers to the product $\psi(x) V^\top$.

Exercise 6: Build a softmax model, train it, and inspect its embeddings.

We are now ready to build a softmax CFModel. Complete the build_softmax_model function in the next cell. The architecture of the model is defined in the function create_user_embeddings and illustrated in the figure below. The input embeddings (movie_id, genre and year) are concatenated to form the input layer, then we have hidden layers with dimensions specified by the hidden_dims argument. Finally, the last hidden layer is multiplied by the movie embeddings to obtain the logits layer. For the target label, we will use a randomly-sampled movie_id from the list of movies the user rated.

Softmax model

Complete the function below by creating the feature columns and embedding columns, then creating the loss tensors both for the train and test sets (using the softmax_loss function of the previous exercise).

Train the Softmax model

We are now ready to train the softmax model. You can set the following hyperparameters:

Note: since our input features are string-valued (movie_id, genre, and year), we need to map them to integer ids. This is done using tf.feature_column.categorical_column_with_vocabulary_list, which takes a vocabulary list specifying all the values the feature can take. Then each id is mapped to an embedding vector using tf.feature_column.embedding_column.

Inspect the embeddings

We can inspect the movie embeddings as we did for the previous models. Note that in this case, the movie embeddings are used at the same time as input embeddings (for the bag of words representation of the user history), and as softmax weights.